Gappy TotalReCaller for RNASeq Base-Calling and Mapping

نویسنده

  • Bud Mishra
چکیده

Understanding complex mammalian biology depends crucially on our ability to define a precise map of all the transcripts encoded in a genome, and to measure their relative abundances. A promising assay depends on RNASeq approaches, which builds on next generation sequencing pipelines capable of interrogating cDNAs extracted from a cell. The underlying pipeline starts with base-calling, collect the sequence reads and interpret the raw-read in terms of transcripts that are grouped with respect to different splice-variant isoforms of a messenger RNA. We address a very basic problem involved in all of these pipeline, namely accurate Bayesian base-calling, which could combine the analog intensity data with suitable underlying priors on base-composition in the transcripts. In the context of sequencing genomic DNA, a powerful approach for base-calling has been developed in the TotalReCaller pipeline. For these purposes, it uses a suitable reference whole-genome sequence in a compressed self-indexed format to derive its priors. However, TotalReCaller faces many new challenges in the transcriptomic domain, especially since we still lack a fully annotated library of all possible transcripts, and hence a sufficiently good prior. There are many possible solutions, similar to the ones developed for TotalReCaller, in applications addressing de novo sequencing and assembly, where partial contigs or string-graphs could be used to boot-strap the Bayesian priors on base-composition. A similar approach would be applicable here too, partial assembly of transcripts can be used to characterize the splicing junctions or organize them in incompatibility graphs and then used as priors for TotalReCaller. The key algorithmic techniques for this purpose have been addressed in a forthcoming paper on Stringomics. Here, we address a related but fundamental problem, by assuming that we only have a reference genome, with certain intervals marked as candidate regions for ORF (Open Reading Frames), but not necessarily complete annotations regarding the 5’ or 3’ termini of a gene or its exon-intron structure. The algorithms we describe find the most accurate base-calls of a cDNA with the best possible segmentation, all mapped to the genome appropriately.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Gappy Total Recaller: Efficient Algorithms and Data Structures for Accurate Transcriptomics

Understanding complex mammalian biology depends crucially on our ability to define a precise map of all the transcripts encoded in a genome, and to measure their relative abundances. A promising assay depends on RNASeq approaches, which builds on next generation sequencing pipelines capable of interrogating cDNAs extracted from a cell. The underlying pipeline starts with base-calling, collect t...

متن کامل

“Calling People to do Good Deeds” and “Commanding Right and Forbidding Wrong” [Amr Bil Ma'ruf and Nahi 'Anil Munkar] and its Effect on Institutionalization in the Legal System of Islamic Republic of Iran

The holy Quran has put “calling people to do good deeds” [or calling people to an act of benevolence] besides and prior to “commanding rightand forbidding wrong”, [or commanding beneficence and forbidding maleficence]. However in many interpretive and jurisprudential books, no sufficient effort has been made to elaborate the difference features between these two and their results, whose  manife...

متن کامل

An adaptive decorrelation method removes Illumina DNA base-calling errors caused by crosstalk between adjacent clusters

Base-calling accuracy is crucial for high-throughput DNA sequencing and downstream analysis such as read mapping and genome assembly. Accordingly, we made an endeavor to reduce DNA sequencing errors of Illumina systems by correcting three kinds of crosstalk in the cluster intensity data. We discovered that signal crosstalk between adjacent clusters accounts for a large portion of sequencing err...

متن کامل

Implementation of a Lean Model for Carrying out Value Stream Mapping in a Manufacturing Industry

Value Stream Mapping technique involves flowcharting the steps, activities, material flows, communications, and other process elements that are involved with a process or transformation. In this respect, Value stream mapping helps an organization to identify the non-value-adding elements in a targeted process and brings a product or a group of products that use the same resources through the ma...

متن کامل

Ultrasensitive detection of TCR hypervariable region in solid - tissue RNA - seq data

70 words) Characterization of tissue-infiltrating T cell repertoire is critical to understanding tumorimmune interactions and autoimmune disease etiology. We present TRUST, an open source algorithm for calling the TCR transcript hypervariable CDR3 regions using unselected RNAseq data profiled from solid tissues. TRUST achieved high sensitivity in CDR3 calling even for samples with low sequencin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013